🎮 Global Video Game Sales - Data Analysis¶

📌 Project Overview¶

This project analyzes global video game sales using a dataset from Kaggle. The aim of this project is to explore things like most popular genres, title, consoles and more.

📊 Dataset Details¶

  • Source: Kaggle (Link : https://www.kaggle.com/datasets/asaniczka/video-game-sales-2024)

🔄 Data Loading & Preprocessing¶

Before diving into analysis, let us load the dataset and inspect its structure.

In [1]:
# Loading the dataset
import pandas as pd

video_games = pd.read_csv("/Users/venkat/Downloads/Global Video Game Sales/Video+Game+Sales/vgchartz-2024.csv")

# Checking the dataset
video_games.head()
Out[1]:
img title console genre publisher developer critic_score total_sales na_sales jp_sales pal_sales other_sales release_date last_update
0 /games/boxart/full_6510540AmericaFrontccc.jpg Grand Theft Auto V PS3 Action Rockstar Games Rockstar North 9.4 20.32 6.37 0.99 9.85 3.12 2013-09-17 NaN
1 /games/boxart/full_5563178AmericaFrontccc.jpg Grand Theft Auto V PS4 Action Rockstar Games Rockstar North 9.7 19.39 6.06 0.60 9.71 3.02 2014-11-18 2018-01-03
2 /games/boxart/827563ccc.jpg Grand Theft Auto: Vice City PS2 Action Rockstar Games Rockstar North 9.6 16.15 8.41 0.47 5.49 1.78 2002-10-28 NaN
3 /games/boxart/full_9218923AmericaFrontccc.jpg Grand Theft Auto V X360 Action Rockstar Games Rockstar North NaN 15.86 9.06 0.06 5.33 1.42 2013-09-17 NaN
4 /games/boxart/full_4990510AmericaFrontccc.jpg Call of Duty: Black Ops 3 PS4 Shooter Activision Treyarch 8.1 15.09 6.18 0.41 6.05 2.44 2015-11-06 2018-01-14
In [2]:
# Exploring the dataset

video_games.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64016 entries, 0 to 64015
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   img           64016 non-null  object 
 1   title         64016 non-null  object 
 2   console       64016 non-null  object 
 3   genre         64016 non-null  object 
 4   publisher     64016 non-null  object 
 5   developer     63999 non-null  object 
 6   critic_score  6678 non-null   float64
 7   total_sales   18922 non-null  float64
 8   na_sales      12637 non-null  float64
 9   jp_sales      6726 non-null   float64
 10  pal_sales     12824 non-null  float64
 11  other_sales   15128 non-null  float64
 12  release_date  56965 non-null  object 
 13  last_update   17879 non-null  object 
dtypes: float64(6), object(8)
memory usage: 6.8+ MB

🔧 Data Cleaning¶

We handle missing and inconsistent data, rename columns for the sake of visualization and handle any incorrect data types

In [3]:
# Check for null_values
video_games.isnull().sum()
Out[3]:
img                 0
title               0
console             0
genre               0
publisher           0
developer          17
critic_score    57338
total_sales     45094
na_sales        51379
jp_sales        57290
pal_sales       51192
other_sales     48888
release_date     7051
last_update     46137
dtype: int64
In [4]:
# Converting release_date column into datetime format

video_games['release_date'] = pd.to_datetime(video_games['release_date'],errors='coerce')

video_games.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64016 entries, 0 to 64015
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   img           64016 non-null  object        
 1   title         64016 non-null  object        
 2   console       64016 non-null  object        
 3   genre         64016 non-null  object        
 4   publisher     64016 non-null  object        
 5   developer     63999 non-null  object        
 6   critic_score  6678 non-null   float64       
 7   total_sales   18922 non-null  float64       
 8   na_sales      12637 non-null  float64       
 9   jp_sales      6726 non-null   float64       
 10  pal_sales     12824 non-null  float64       
 11  other_sales   15128 non-null  float64       
 12  release_date  56965 non-null  datetime64[ns]
 13  last_update   17879 non-null  object        
dtypes: datetime64[ns](1), float64(6), object(7)
memory usage: 6.8+ MB
In [5]:
#Renaming columns for better understanding
video_games.rename(columns={'title':'Title','console':'Console','genre':'Genre','publisher':'Publisher','developer':'Developer'},inplace=True)

video_games.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64016 entries, 0 to 64015
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   img           64016 non-null  object        
 1   Title         64016 non-null  object        
 2   Console       64016 non-null  object        
 3   Genre         64016 non-null  object        
 4   Publisher     64016 non-null  object        
 5   Developer     63999 non-null  object        
 6   critic_score  6678 non-null   float64       
 7   total_sales   18922 non-null  float64       
 8   na_sales      12637 non-null  float64       
 9   jp_sales      6726 non-null   float64       
 10  pal_sales     12824 non-null  float64       
 11  other_sales   15128 non-null  float64       
 12  release_date  56965 non-null  datetime64[ns]
 13  last_update   17879 non-null  object        
dtypes: datetime64[ns](1), float64(6), object(7)
memory usage: 6.8+ MB
In [6]:
#Extracting release_year from the release_date column

video_games = video_games.assign(release_year = lambda x:x['release_date'].dt.year)

video_games.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64016 entries, 0 to 64015
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   img           64016 non-null  object        
 1   Title         64016 non-null  object        
 2   Console       64016 non-null  object        
 3   Genre         64016 non-null  object        
 4   Publisher     64016 non-null  object        
 5   Developer     63999 non-null  object        
 6   critic_score  6678 non-null   float64       
 7   total_sales   18922 non-null  float64       
 8   na_sales      12637 non-null  float64       
 9   jp_sales      6726 non-null   float64       
 10  pal_sales     12824 non-null  float64       
 11  other_sales   15128 non-null  float64       
 12  release_date  56965 non-null  datetime64[ns]
 13  last_update   17879 non-null  object        
 14  release_year  56965 non-null  float64       
dtypes: datetime64[ns](1), float64(7), object(7)
memory usage: 7.3+ MB
In [7]:
video_games.head()
Out[7]:
img Title Console Genre Publisher Developer critic_score total_sales na_sales jp_sales pal_sales other_sales release_date last_update release_year
0 /games/boxart/full_6510540AmericaFrontccc.jpg Grand Theft Auto V PS3 Action Rockstar Games Rockstar North 9.4 20.32 6.37 0.99 9.85 3.12 2013-09-17 NaN 2013.0
1 /games/boxart/full_5563178AmericaFrontccc.jpg Grand Theft Auto V PS4 Action Rockstar Games Rockstar North 9.7 19.39 6.06 0.60 9.71 3.02 2014-11-18 2018-01-03 2014.0
2 /games/boxart/827563ccc.jpg Grand Theft Auto: Vice City PS2 Action Rockstar Games Rockstar North 9.6 16.15 8.41 0.47 5.49 1.78 2002-10-28 NaN 2002.0
3 /games/boxart/full_9218923AmericaFrontccc.jpg Grand Theft Auto V X360 Action Rockstar Games Rockstar North NaN 15.86 9.06 0.06 5.33 1.42 2013-09-17 NaN 2013.0
4 /games/boxart/full_4990510AmericaFrontccc.jpg Call of Duty: Black Ops 3 PS4 Shooter Activision Treyarch 8.1 15.09 6.18 0.41 6.05 2.44 2015-11-06 2018-01-14 2015.0
In [8]:
video_games.describe()
Out[8]:
critic_score total_sales na_sales jp_sales pal_sales other_sales release_year
count 6678.000000 18922.000000 12637.000000 6726.000000 12824.000000 15128.000000 56965.000000
mean 7.220440 0.349113 0.264740 0.102281 0.149472 0.043041 2006.359572
std 1.457066 0.807462 0.494787 0.168811 0.392653 0.126643 8.617813
min 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1971.000000
25% 6.400000 0.030000 0.050000 0.020000 0.010000 0.000000 2001.000000
50% 7.500000 0.120000 0.120000 0.040000 0.040000 0.010000 2008.000000
75% 8.300000 0.340000 0.280000 0.120000 0.140000 0.030000 2012.000000
max 10.000000 20.320000 9.760000 2.130000 9.850000 3.120000 2024.000000

Exploratory Data Analysis¶

📈 Sales Trends Over the Years¶

Analyzing yearly trends helps us understand how the gaming industry has evolved.

In [9]:
#This dataframe calculates sum of total_sales by release_year

annual_sales = video_games.groupby('release_year',as_index=False).agg({'total_sales':'sum'})

annual_sales.head()
Out[9]:
release_year total_sales
0 1971.0 0.00
1 1973.0 0.00
2 1975.0 0.00
3 1977.0 2.50
4 1978.0 2.36
In [10]:
# Creating a line chart here to plot total sales by year

import plotly.express as px

px.line(annual_sales,x='release_year',y='total_sales')

Top 10 titles¶

In [11]:
#This dataframe contains the ten highest selling titles by total_sales ranked from highest to lowest

top10_titles = (
    video_games.groupby("Title",as_index=False).
    agg({'total_sales':'sum'}).
    sort_values('total_sales',ascending=False).
    iloc[:10]
)
top10_titles.head(10)
Out[11]:
Title total_sales
13724 Grand Theft Auto V 64.29
5266 Call of Duty: Black Ops 30.99
5281 Call of Duty: Modern Warfare 3 30.71
5273 Call of Duty: Black Ops II 29.59
5277 Call of Duty: Ghosts 28.80
5271 Call of Duty: Black Ops 3 26.72
5280 Call of Duty: Modern Warfare 2 25.02
20998 Minecraft 24.01
13719 Grand Theft Auto IV 22.53
5265 Call of Duty: Advanced Warfare 21.78
In [12]:
#Creating a bar chart of the top ten selling titles of all time

px.bar(top10_titles,x='Title',y='total_sales')

Building a Dashboard¶

In [ ]:
pip install jinja2==3.0.3
In [13]:
# Import necessary libraries
from dash import Dash, dcc, html
from dash.dependencies import Output, Input
from dash_bootstrap_templates import load_figure_template
import dash_bootstrap_components as dbc

import plotly.express as px
import pandas as pd
import numpy as np

# Load dataset, parse release_date as datetime, rename columns for clarity, and add a release_year column

video_games = (pd.read_csv("/Users/venkat/Downloads/Global Video Game Sales/Video+Game+Sales/vgchartz-2024.csv",parse_dates=['release_date']).
               rename({'title':'Title','console':'Console','genre':'Genre','publisher':'Publisher',
                       'developer':'Developer'},axis=1).
               assign(release_year = lambda x:x['release_date'].dt.year)
              )

# Initializing the Dash app

app = Dash(__name__)


# Defining the layout of the Dash app

app.layout = dbc.Container([
    html.H1("Video game explorer",style={'text-align':'center'}),
    dbc.Row([
        dbc.Col([
            dbc.Card([
                dcc.Markdown("**Select A Category**"),
                dcc.Dropdown(
                id="category-dropdown",
                options=["Title","Genre","Publisher","Developer","Console"],value="Title",className="dbc"),
            ])
        ]),
        dbc.Col([
            dbc.Card([
                dcc.Markdown("**Select A Region**"),
                dcc.RadioItems(
                id="region-radio",
                options={
                    'total_sales':'World Total',
                    'na_sales':'North America',
                    'jp_sales':"Japan",
                    'pal_sales':'Europe/Africa',
                    'other_sales':'Rest of World'}, value='total_sales',className="dbc"),
            ])
        ]),]),
    html.Br(),
    dbc.Row(dcc.Graph(id="sales-line")),
    html.Br(),
    dbc.Row(dcc.Graph(id="rankings-bar")),
])


# Defining the callback function to update both graphs based on user selections
@app.callback(
Output('sales-line','figure'),
Output('rankings-bar','figure'),
Input('category-dropdown','value'),
Input('region-radio','value'),
)

# The two charts created above have been added to the dashboard which will allow us to select title, genre, 
# publisher, developer and console with a dropdown and total_sales, jp_sales, na_sales, pal_sales 
# and other_sales with radio buttons



def vg_plotter(category, region):
    
    annual_sales=video_games.groupby("release_year",as_index=False).agg({region:"sum"})
    
    fig = px.line(
        annual_sales,
        x='release_year',
        y=region,
        title=f"Video Game Sales in {region} Over Time"
    ).update_layout(title_x=0.5)
    
    top10_sellers = (
        video_games
        .groupby(category, as_index=False)
        .agg({region:"sum"})
        .sort_values(region, ascending=False)
        .iloc[:10]
    )
    
    fig2 = px.bar(
        top10_sellers,
        x=category,
        y=region,
        title=f"Top Video Game Sales by Category"
    ).update_layout(title_x=0.5)
    
    return fig, fig2

#Run the dash app
if __name__ == "__main__":
    app.run_server()

               

🏆 Key Insights & Takeaways¶

  1. The Sports and Action genres dominate total global sales.
  2. Video game sales peaked in the year 2008, with a decline afterward.
  3. EA Canada tops the list for Video Game sales when categorised based on Developer
  4. PS2 console dominates the industry.